In [1]:
In [2]:
user_id gender age income customer score
0 1 Male 55.0 112000 11
1 3 Male 68.0 70000 13
2 5 Female 58.0 51000 5
3 8 Female 62.0 71000 4
4 12 Female 40.0 71000 6
Out[2]:
user_id age income customer score
count 12950.000000 12950.000000 12950.000000 12950.000000
mean 7423.445869 53.380011 65329.266409 16.633668
std 4274.906366 17.079656 21583.713051 16.770312
min 1.000000 18.000000 30000.000000 4.000000
25% 3744.250000 41.250000 49000.000000 6.000000
50% 7392.500000 55.000000 63000.000000 10.000000
75% 11128.750000 66.000000 79000.000000 20.000000
max 14824.000000 90.000000 120000.000000 100.000000

Plotting Gender

In [6]:

After plotting gender we can see that there isn't a direct relation in order to segment the customers using this feature. However we can note interesting patterns like the relation between Income and Score or Age and Score, so we will choose to focus on those categories

In [7]:
In [9]:

Elbow Curve

We have created several KNN models with different number of clusters (1-10). We have to choose a number cluster in a way that:

  • The sum of squared distances of samples to their closest cluster center (inertia_) is minimum as possible.
  • The number of clusters is maximum of possible. More clusters allows us to make sure we can identify specific groups of customers
In [10]:
1234567891001T2T3T4T5T6T
KNN - Elbow Curveintertia# Clusters

Looking the next plots, seems that the best choice is using 3-5 clusters before innertia drops off. (we can't know for sure if this is the best way of labeling the data due to this is a unsupervised problem though). W

We hope to segment and categorize the customers by the following charecteristics;

  • Label 0 is low income and low spending
  • Label 1 is high income and high spending
  • Label 2 is mid income and mid spending
  • Label 3 is high income and low spending
  • Label 4 is low income and high spending
In [11]:
In [12]:
50k100k02040608010050k100k02040608010050k100k020406080100
Clustering - KNN Model3 Clusters4 Clusters5 ClustersIncomeScore
In [16]:
0123440k60k80k100k120k01234020406080100
Clustering - Attributes per ClusterClustering for IncomeClustering for ScoreLabel

Hierarchical Clustering

Aggloomerative Hierarchical Clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.

This is a type of clustering that requires two types of inputs:

  • n_clusters: Number of clusters or centroids to generate.
  • linkage: Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.: *‘ward’ minimizes the variance of the clusters being merged. *‘average’ uses the average of the distances of each observation of the two sets. *‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets. *'single’ uses the minimum of the distances between all observations of the two sets.
In [14]:
40k60k80k100k120k02040608010040k60k80k100k120k020406080100
AgglomerativeClusteringKNNIncomeCustomer Score

Agglomerative Hierarchical Clustering - Dendogram It’s possible to visualize the tree representing the hierarchical merging of clusters as a dendrogram. Visual inspection can often be useful for understanding the structure of the data, though more so in the case of small sample sizes.

It’s possible to get the distance matrix that contains the distance from each point to every other point of a dataset. Will be useful for using it with the linkage class. Using the hierarchy.linkage function, we'll be able to perform hierarchical/agglomerative clustering. This function cut hierarchical clusterings into flat clusterings or find the roots of the forest formed by a cut by providing the flat cluster ids of each observation.

In [*]:

Density Based Clustering (DBSCAN) K-means, hierarchical and fuzzy clustering perform really well in non-supervised data, however, for tasks with arbitrary shape clusters or clusters within clusters, those techniques might perform poorly (that's because elements in the same cluster might not share similarities).

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped.

The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

The whole idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster.Note: Density = Number of points within a specified radius.

DBSCAN uses two different parameters:

Epsilon: Determines a specified radius that if includes enough number of points within, we call it dense area. minimumSamples: Determines the minimum number of data points we want in a neighborhood to define a cluster.

In [*]:

Is possible to see that DBSCAN doesn't perform well. The density in the data is not strong enough (we can see that we have labels -1).

Mean Shift Algorithm

Centroid based algorithm which goal is finding blobs in a smooth density of samples. It works as next:

Updating candidates for centroids to be the mean of the points withing a given region. Those candidates are filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. The algorithm sets the number of clusters (instead of relying on a paremeter bandwith that dictates the size of the region for searching). The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

The algorithm is not highly scalable, as it requires multiple nearest neighbor searches during the execution of the algorithm. The algorithm is guaranteed to converge, however the algorithm will stop iterating when the change in centroids is small.

In [ ]:

Summary

Most of the algorithms use 5 clusters (except DBSCAN). KNN5, AgglomerativeClustering and Mean Shift perform really good for clustering the instances while DBSCAN not. DBSCAN is not performing well because the density is not enough for feeding this algorithm.

In [ ]:

Plot Statistics

In [ ]:
In [ ]: